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Abstract: Wc consider the problem of estimating the inverse covariance matrix by maximizing the 

Qhkelihood function with a penalty added to encourage the sparsity of the resulting matrix. We propose 
a new approach based on the split Bregman method to solve the regularized maximum likelihood 
, estimation problem. We show that our method is significantly faster than the widely used graphical 

lasso method, which is based on blockwise coordinate descent, on both artificial and real-world data. 
More importantly, different from the graphical lasso, the split Bregman based method is much more 
general, and can be applied to a class of rcgularization terms other than the £i norm. 

' Keywords and phrases: Inverse covariance matrix. Split Bregman, l\ norm. Graphical model. 

, 1. Introduction 

C/3 . Undirected graphical models provide an efiicient way to describe and explain the relationships among a set 

of objects (or variables), and have become a popular tool to model networks of components in a variety 
. of applications, including internet, social networks, and gene networks [1, 2. 3]. A key step involved in 

^ ' constructing a graphical model for a particular application is to learn the structure of the graphical model 

. from a set of observations, which is often called model selection in statistics. If we assume the observations 

' have a multivariate Gaussian distribution with mean ^ and covariance matrix S, learning the structure of a 

(«~^ . graphical model can be reformulated as a covariance selection problem [4], which aims to identify nonzero 

entries of the inverse covariance matrix (also known as concentration matrix or precision matrix). The idea 
behind this formulation is that if the (i, j)-entry of is zero, then the variable i and j are conditionally 
independent, given the other variables [5, 6]. 

The principle of parsimony suggests that among the graphical models that adequately explains the data 
we should select the simplest. Thus, it is natural to impose an £i penalty for the estimation of to promote 
the sparsity of the resulting graph, as has been proposed by a number of authors [7, 2, 8, 9]. 

It has been noted that if we fit a linear regression model to each variable using the others as predictors, the 
regression coefficient of variable j on i will be, up to a positive scalar, equal to the (i, j)-entry of S"^ [10, 6]. 
Based on this observation, Meinshausen and Biihlmann [7] proposed a simple approach to the structural 
learning problem; they estimate a sparse graphical model by fitting a lasso regression model for each variable, 
treating the variable as the response and the others as predictors. The (i,j)-entry of is estimated to 
be non-zero if either the estimated coefficient of variable i on j or the estimated coefficient of variable j on 
i is nonzero (alternatively, one can use an AND rule, requiring both coefficients to be nonzero). Although 
this approach is computationally attractive and has been shown to be able to consistently estimate 
asymptotically, it docs not take the intrinsic symmetry of into account, and could result in contradictory 
neighborhoods and a non-positive definite concentration matrix. Furthermore, if the same penalty parameter 
is used for all lasso regressions, as is commonly done to reduce the number of parameters, the approach will 
not be able to correctly infer graphs with skewed degree distributions. To overcome these limitations, Peng 
et al. [6] proposed a symmetric regression approach; however, the derived concentration matrix can still be 
non-positive-definitc. 

A more principled approach to covariance selection is to find the E"^ that maximizes the log- likelihood 
of the data with an added li penalty [11] to encourage the sparsity of the resulting graph [2, 8, 9]. More 
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specifically, suppose we are given n samples independently drawn from a p-variate Gaussian distribution: 
Xi, . . . , x„ ~ A/'(/i, S). Let S be the empirical covariance matrix: 

1 " 

S := - V(x, - At)(xi - m)^- 
n ^ — ' 

k=\ 

Denote 6 = S^^. The goal is to find the 0* that maximizes the penalized log-likelihood 

logdete-tr(5e)-A^|e,j|, (1) 

subject to the constraint that 8 is positive definite. Here, we use tr to denote the trace and Qij to denote 
the (i,j)-entry of 0. Due to the i\ penalty term and the explicit positive definite constraint on 0, the 
method leads to a sparse estimation of the concentration matrix that is guaranteed to be positive definite. 
The simpler approach of Mcinshausen and Buhlmann [7] can be viewed as an approximation to the penalized 
maximum likelihood formulation [8, 9]. 

The objective function in (1) is strictly convex, so a global optimal solution is guaranteed to exist and be 
unique. Finding the optimal solution can, however, be computationally challenging due to the logdct term 
appeared in the likelihood function and the nondiffcrcntiability of the i\ penalty. Yuan and Lin [2] solved (1) 
using the interior-point method for the "maxdet" problem, which is prohibitive for problems with more than 
tens of variables due to its memory requirements and computational complexity. Banerjee et al. [8] developed 
a blockwise coordinate descent method to solve (1), after noting that the dual of (1) can be reduced to a lasso 
regression problem if one focuses on optimizing only one row or column of fJ. Exploiting this observation 
further, Friedman et al. [9] used the coordinate descent procedure to solve the lasso regression arising in the 
dual of the blockwise coordinate descent, and implemented an efficient method, called "graphical lasso", to 
solve (1). The graphical lasso is remarkably fast; it can solves a p = 1000 dimension (^500,000 parameters) 
problem within a minute. However, a major limitation of the blockwise coordinate descent methods, including 
the graphical lasso, is that their derivation is specific to the l\ penalty. This limitation prevents them from 
being applied to other types of penalties, such as a combination of l\ and I2 penalty used in the elastic net 
[12], the fused lasso penalty [13], and the group lasso penalty [14], which have been found to be more useful 
than the simple i\ penalty in a number of applications. 

In this paper, we propose a new algorithm based on the split Bregman method [15, 16] to solve the 
penalized maximum likelihood estimation problem (1). We show that our method is not only substantially 
faster than the graphical lasso method, but can also be easily extended to deal with a broad range of penalty 
terms. 

Although the Bregman iteration was an old technique proposed in the sixties [17, 18], it gained significant 
interest only recently after Osher and others demonstrated its high efficiency for sparsity recovery in a wide 
range of problems, including image restoration [19, 15, 16], compressed sensing [20, 21, 22], matrix completion 
[23] , low rank matrix recovery [24] , and general fused lasso problems [25] . By introducing an auxiliary variable, 
we show that the optimization problem in (1) can be reformulated such that the log- likelihood term and the 
penalty term interact only through an equality constraint, thereby enabling an efficient application of the 
split Bregman method to solve the optimization problem. 

The rest of the paper is organized as follows. In Section 2, we first present a generalized sparse inverse 
covariance matrix estimation problem, which includes (1) as a special case. We then derive a split Bregman 
algorithm, called SBGM, to solve the generalized problem (Subsection 2.2). The convergence property of 
the algorithm is also given. SBGM consists of three update steps with the first update being the major 
time-consuming one. In Subsection 2.3, we provide an explicit formula for the first update step and propose 
an efficient Newton method to solve the resulting matrix quadratic equation. In Section 3, we implement 
SBGM to solve the special case (1) . In Section 4, we illustrate the utility of our algorithm and compare its 
performance to the graphical lasso method using both simulated data and gene expression data. 

2. The split Bregman method for generalized sparse graphical models (SBGM) 

The split Bregman method was originally proposed by Osher and coauthors to solve total variation based 
image restoration problems [15]. It was later found to be cither equivalent or closely related to a number of 
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other existing optimization algorithms, including Douglas- Rachford splitting [26], the alternating direction 
method of multipliers (ADMM) [27, 28, 15, 29] and the method of multipliers [30]. Because of its fast 
convergence and the easiness of implementation, it is increasingly becoming a method of choice for solving 
large-scale sparsity recovery problems [24, 16, 25]. 

In this section, we first extend the £i regularized maximum likelihood inverse covariance matrix estimation 
problem (1) to allow for a more general class of regularization terms. We reformulate the problem by intro- 
ducing an auxiliary variable. We then proceed to derive a split Bregman method to solve the reformulated 
problem. 

2.1. A general problem and its reformulation 

We derive our algorithm in a more general setting than the one described in (1). Instead of using the £i- 
norm penalty, we consider a more general penalty c/>(0), which is only required to be convex and satisfy 
0(0) = The li norm and a number of other types, including those used in elastic net [12], fused 

lasso [13], and group lasso [14], can be viewed as a special case of the general penalty term. We find 8* by 
solving the following constrained optimization problem 

min-logdete-Htr(S'e)-t-A(/)(e), (2) 

where )^ means that Q is positive definite, and A € is a regularization parameter. If we choose 
4'{®) = X^i^^j |Qy |j the general problem reduces to the problem (1). 

The log-likelihood term and the regularization term in (2) are coupled, which makes the optimization 
problem difficult to solve. However, the two terms can be decoupled if we introduce an auxiliary variable 
to transfer the coupling from the objective function to the constraints. More specially, the problem (2) is 
equivalent to the following problem 

min - log det 8 + tr(S'e) + X(i>{A) 
s.t. A = Q 

9^0. (3) 

The introduction of the new variable of ^ is a key step of our algorithm, which makes the problem 
amenable to a split Bregman procedure to be detailed below. 



2.2. Derivation of the split Bregman method 

Although the split Bregman method originated from Bregman iterations [31, 16, 32], it has been demonstrated 
to be equivalent to the alternating direction method of multipliers (ADMM) [27, 28, 33, 34]. For simplicity 
of presentation, next we derive the split Bregman method using the augmented Lagrangian method [35, 30]. 
We first define an augmented Lagrangian function of (3) 

£(6, A, A/) = - logdet e + tr(5e) + A0(A) +tr(Af'^(e- A)) + |||e-A[||,, (4) 

where matrix M is a dual variable corresponding to the linear constraint Q = A^ and ^ > is a parameter. 
Compared with the standard Lagrangian function, the augmented Lagrangian function has an extra term 
^](0 — A]j|,, which penalizes the violation of the linear constraint Q = A. 

With the definition of the augmented Lagrangian function (4), the primal problem (3) is equivalent to 

min max£(e,A, M). (5) 
e^o.A M ^ ' ^ ' 

Exchanging the order of min and max in (5) leads to the formulation of the dual problem 

maxE(M), with E( M) ^ min C(Q, A, M). (6) 

M e>-o,A 



Note that the gradient VE{M) can be calculated by the following [36] 



VE(M) = Q{M) - A{M), with {<d{M),A{M)) ^ arg min C(Q,A,M). (7) 

Applying gradient ascent on the dual problem (6) and using equation (7), we obtain the method of 
multipliers [30] to solve (3) 

r(efe+i^^fe+i) = argmine^o,A/:(e,AAf^-), .ox 

Here we have used /i as the step size of the gradient ascent. It is easy to see that the efhciency of the iterative 
algorithm (8) largely hinges on whether the first equation of (8) can be solved efficiently. Note that the 
augmented Lagrangian function C{Q, A, M^) still contains a nondifferentiable term (t>{A). But different from 
the original objective function (2), the function (j) induced nondifferentiable term has now been transferred 
from terms involving to terms involving A only. Thus we can solve the first equation of (8) through an 
iterative algorithm that alternates between the minimization of and A, 



Qk+i ^ argmine^o - logdet 8 + tr(5'e) + tr((Af*=)^(e - A^)) + f ||e - A 
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|yl'^+i=argminAA0(A)+tr((A/'^)^(e'^+i-A)) + f||e'=+i-Ai||,. ^ ' 

The method of multipliers requires that the alternative minimization of and A in (9) be run multiple 
times until convergence. However, because the first equation of (8) represents only one step of the overall 
iteration, it is actually not necessary to be solved completely. In fact, the split Bregman method (or the 
alternating direction method of multipliers [27]) uses only one iteration of (8), which leads to the following 
iterative algorithm for solving (3), 

'e'^'+i = argmine^o - logdet e + tr(S'e) +tr((M'^')'^(e - A^)) + f ||e - 

= argmin^ A0(A) + f jje'^+i - A + ^-^A/^IH, (10) 
M'^+i = A/*^ + /i(e'^+i - A'^+i). 



2.2.1. Convergence Property 

The convergence of the iteration (10) can be derived from the convergence theory of the alternating direction 
method of multipliers or the convergence theory of the split Bregman method [27, 37, 16]. 

Theorem 1. Let he generated by (10), and Q* be the unique minimizer of (3). Then, 

lim ||e'^-e*|| = 0. 

k^oo 

From Theorem 1, the condition for the convergence of the iteration (10) is quite mild and even irrelevant 
to the choice of the parameter /i in the iteration (10). This property makes the split Bregman method quite 
general and easy to implement, which partly explains why the method is gaining popularity recently. 



2.2.2. Updating 8 and A 

We first focus on the computation of the first equation of (10). Taking the derivative of the objective function 
and setting it to be zero, we get 

-Q-^ + fiQ^ ^lA^ - S- M^. (11) 

It is a quadratic equation where the unknown is a matrix. The complexity for solving this equation is at 
least 0{p^) because of the inversion involved in (11). Note that because 0(0) = 4){Q^), if is symmetric. 
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so is fiA'' — S — . It is easy to check that the exphcit form for the solution of (11) under constraint ;^ 0, 
i.e., 9''+^, is 

^ i^NVCgF+M, where = ^lA'^ - S - MK (12) 

Here \fC is the square root of a symmetric positive definite matrix C . Recall that the square root of a 
symmetric positive definite matrix C is defined to be the matrix whose eigenvectors are the same as those 
of C and eigenvalues are the square root of those of C. Therefore, to get the update of 8*^+^, one may first 
compute the eigenvalues and eigenvectors of , and then get the eigenvalues of 0'"'+^ according to (12) 
by replacing the matrices by the corresponding eigenvalues. This approach is rather slow due to the eigen 
decomposition step. In the next subsection, we will propose a faster algorithm for solving (11) based on 
Newton's iteration [38, 39]. 

For the second equation of (10), we have made the data fitting term ^|j8''+^ — A + ^~^M'^|j|, separable 
with respect to the entries of A. Thus, if the 4>{A) is separable, then it is very easy to get the solution 
and the computational complexity would be 0{p^) for most cases. See Section 3 for the special case of 
(j){A) = '^i=ij Thus, compared with the update of 8, the computational time for updating A is minor 

and mostly negligible. The computation of the third equation of (10) is straightforward and the computational 
cost is also 0{p^), which can be neglected as well. So the overall computational time for running one iteration 
of (10) mostly depends on how fast 8 can be updated. 



2.3. Newton's method to calculate the square root of a positive definite matrix 

As described above, the main step to obtain an update of 8*^+^ is calculating the square root of the positive 
definite matrix [K^Y + where is a symmetric matrix. One possible way is to calculate all the 
eigenvalues and eigenvectors of matrix ^ which is computational demanding when the size of the matrix 
is large. In this subsection, we will use Newton's method [39, 38] to calculate the square root of a positive 
definite matrix directly instead of using the eigenvalue decomposition of . Our numerical experiments show 
that Newton's method for calculating the square root of a positive definite matrix of the form + 4/^/, 
where K is symmetric, is about four times faster than the one using eigen decomposition. 

We begin with finding the positive root of the equation — a = 0(a > 0) using Newton's method [40, 41]. 
That is, 

^"^"-\^-" + ^)- (13) 

The following lemma ensures that (13) with x*^ > always converges to \/a and the convergence rate is 
quadratic. 

Lemma 1. If we choose > 0, then generated by (13) is well-defined and satisfies 

\x''+^ - Va\< mini - ^/af,l-\x'' - 

1^ 2y/a ' ' 2 ' 

Proof. Since x° > 0, x^ is well-defined and x^ = ^{x^ + ^) > ^/a. By induction, x'' is well-defined and 
x'^ > y/a. Moreover, using the iteration (13), 



and 



1 / fc " - 



□ 
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Now we apply Newton's method to calculate the square root of a positive definite matrix of the form 
+ al, where K is symmetric and a is a positive constant. Let = y/al and 

X'=+i = i(X'= + (X'=)-i(A'2 + a/)). (14) 

Then the iteration (14) always converges quadratically to V I-C^ + al. Similar algorithms have been proposed 
in [42, 43, 41] to compute polar decompositions. 

Theorem 2. If we choose — y/al, then X'^ generated by (14) is well-defined and quadratically converges 
to y/lC^ + al . More specifically, 

WX'^ VK^ + alh < mini L.= =\\X'^ ^ + al\\l - + alh] , (15) 

[ 2A/a + Amin(ii-^) ^ J 

where Aniin(^^) is the minimum eigenvalue of matrix K. 

Proof. Let K = UVlU'^ with Vt = diag(cJi, . . . ,ujp) and UU'^ = U'^U = I. Since X° = ^a/, by induction, 
X^ can be written as X'^ = Uft^U'^ with il'' = diag(a;j|^, . . . ,u!p) and > 0. Furthermore, 



X 



k+l 



C/Q(0^ + (r!^)-i(r!2+a/))^ U^. 



Therefore, (14) changes only the singular values which are governed by (13). The Theorem follows immedi- 
ately from Lemma 1. □ 

Note that the choice of the initial guess X° can be any matrix of the form UAU'^, where U is the 
eigenvectors of K and A is a diagonal matrix with positive diagonal. In the above, we have chosen = y/al 
for simplicity. 

Combining the iteration (10) and Newton's method to calculate the square root of the positive matrix of 
the form + Ajil where K is symmetric, we get Algorithm 1 to compute the solution of (2). 

Algorithm 1 Split Bregman method for generalized graphical models (2) (SBGM) 

Given ^. Initialize 0°, A^, and M°. 
repeat 

1) Compute = fiA^' ~ S ~ M'' 

2) Use Newton method to compute X*" = ^ (K'^Y + 4^-^ 

3) 6''+^ = ^ 

4) A'^+i = argmin^ M^A) + ^||0 - A* ^ii-^M^\\\ 

5) Af'^+i = M*^ + /i(0'=+i - A'-'+i) 
until 

Convergence 



3. The Split Bregman method for £i penalized graphical models (SBGLasso) 

In this section, we describe a detailed implementation of the split Bregman method to solve the l\ penalized 
inverse covariance matrix estimation problem (1). The i\ penalty corresponds to a special case of the general 
penalty used in (2) with 0(6) = X^i^^j 

Algorithm 1 can be applied here with only minor changes. The updates of and and M are exactly the 
same as the ones in Algorithm 1. For the update of A, let 7a be a soft thresholding operator defined on 
matrix space and satisfying 

T\(Sl) = = 1, with txiiOij) = sgn(wy ) max{0, |a;„| - A}. 

i ^ i 

Then the update of A is 

^'^■+1 = ri(e'^+i + Ai"^M'=). 

So we obtain Algorithm 2 to solve (1). 
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Algorithm 2 Split Brcgman method for £i penalized graphical models (SBGLasso) 



Initialize e'\A°, and M^. 
repeat 

1) Compute = iiA^ ~ S ~ M'' 

2) Use Newton method to compute X*' = yj {K^)'^ + i/il 

3) 6'=+^ = ^ 

4) A'^+i = TaCO'-'+i +^-1M*) 

5) M'^+i = M*^ + /i(e'^+i - A'^+i) 
until 

Convergence 



4. Numerical experiments 

hi this section, we use time trials to illustrate the utility of our proposed algorithms. We first demonstrate 
the efficiency of the proposed Newton's method for calculating the square root of a positive definite matrix 
by comparing it with the eigenvalue decomposition method. We then benchmark the performance of the split 
Brcgman method (SBGLasso) for solving the £i penalized inverse covariance matrix estimation problem, and 
compare it to the graphical lasso method. The time trials were all conducted on an Intel Core 2 Duo desktop 
PC (E7500, 2.93GHz). 

4.I. Newton's method versus eigenvalue decomposition method for computing the square root 
of a positive definite matrix 

We first illustrate the efficiency of the Newton's method for computing the square root of a positive definite 
matrix of form M = K'^ + fj.I, where K = {B + B')/2 and B is a p X p random Gaussian matrix with its entries 
randomly drawn from the standard Gaussian distribution. The form of the matrix M we choose is mostly 
motivated by the square root we need to solve in the equation (12). Our algorithm is implemented in Matlab 
using mex programming. We compare it with the method using matlab function "schur" on the matrix K. 
More specifically, we use the two-line code "[[/, D] ~ schur{K); SR ~ U * diag{sqrt{diag{D) .'2 + fi)) * [/"' 
to compute the square root of M. Note that the upper triangular matrix produced by schur decomposition 
is exactly diagonal since K is symmetric. 

In Table 1, we have listed the number of iterations needed for our Newton's method to solve the square 
root of M, where we stop the iterations whenever the relative change of two successive steps is less than 
10~^. We also calculated the relative difference between the result derived by our method and the one by 
schur decomposition. As shown in Tabic 1, the relative difference is of precision of order 10~^, demonstrating 
that our algorithm is highly accurate. Empirically, we find that our algorithm only needs 8 steps to converge 
and the number of steps is mostly constant as the matrix size increases, which illustrates the quadratic 
convergence rate of our algorithm given in Theorem 2. Overall, our algorithm is substantially faster than the 
schur decomposition based method, saving more than 75% of the computational times on all examples we 
tested. For example, when n = 2000, our algorithm takes 7.45 seconds to find a solution, significantly lower 
than the 38.08 seconds used by the method based on schur decomposition. 

Table 1 

Experimental results for computing the square root of of M . All results are averages of 10 runs. 





Newton method 


Schur decomposition 


relative 


p 


iters in Newton method 


total timc(s) 


total time(s) 


difference 


1000 


8 


1.20 


4.54 


9.69 X 10"^* 


1500 


8 


3.35 


16.11 


8.54 X 10"" 


2000 


8 


7.45 


38.08 


3.09 X 10-^^ 


2500 


8 


13.56 


76.03 


3.75 X 10-^" 


3000 


8 


22.43 


127.32 


2.35 X lO"'' 
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4-2. SBGLasso versus the graphical lasso for £i penalized graphical models 

Next wc illustrate the efRciency of the split Bregman method for £i penalized maximum likelihood graphical 
models (SBGLasso) using time trials. The graphical lasso proposed by Friedman et al. [9] is by far the most 
efficient method for solving the large-scale inverse covariance matrix estimation problem (f), so we focus our 
comparison with the graphical lasso method. SBGLasso is coded in C, linked to a Matlab function, while the 
graphical lasso is coded in Fortran, linked to an R language function, so it is reasonable to compare them 
using time trials despite that they are implemented in two different languages. 

Thestoppingcriteriaof SBGLasso is specified as follows. Let $(6'^) = - logdet 8'""+tr(5'9'^)+A 
Since the ii penalized maximum likelihood graphical model (1) is a special case of model (2), we have 
limfc^oo $(0'^) = ^'(B*) by Theorem 1. We terminate the algorithm when the relative change of the 
energy functional $(8) falls below a certain threshold S. In addition, we require the relative difference 
||9 — ^||F/||0||Ftobe less than S to make sure that the resulting solution is also primal feasible. We used 
5 = 10~^ in our simulation. For the graphical lasso, we also set the termination threshold to be 10~^. 

Note that the convergence of Algorithm 2 is guaranteed no matter what values of fj, are chosen as shown 
in Theorem 1. However, the choice of fi can influence the speed of the algorithm since it could affect the 
number of iterations involved. In our implementation, we found empirically that choosing fi = 0.5 works well 
for the artificial data and n — 0.012 for the gene expression data. 

4- .2.1. Artificial data 

To generate the artificial data, we first create a set of sparse inverse covariance matrices with different 
dimension p, and then generate n samples from the multivariate Gaussian distribution parametrized by 
each of the inverse covariance matrices. The sparse inverse covariance matrices are created using the same 
procedure as described in [8]. More specially, to create an inverse covariance matrix, we first generate a 
random p x p diagonal matrix with positive entries, and then symmetrically insert random numbers to 
approximately p randomly chosen locations of the matrix. Positive definiteness is ensured by adding a 
multiple of the identity to the matrix if needed. The penalty parameter is chosen such that the estimated 
inverse covariance matrix has roughly the same number of nonzero entries as the actual inverse covariance 
matrix. 

Table 2 shows a comparison of SBGLasso and the graphical lasso on both computational times as well 
as relative errors after testing on the artificial data. We note that SBGLasso is significantly faster than the 
graphical lasso for large-scale data {p > 500), while being able to achieve the same levels of accuracy. It takes 
less than half of the computational times used by the graphical lasso for all cases we tested except when 
p = 500. For example, when n = 2000, p = 3000, SBGLasso takes only 269.53 seconds to find the optimal 
solution, compared to the 729.63 seconds used by the graphical lasso. 

To evaluate how the performance of SBGLasso scales with the problem size, we plotted the CPU time 
that SBGLasso took to solve the problem (1) as a function of p and n. The CPU time is roughly quadratic 
in p and constant in n. This phenomena is expected since each of the concentration matrices used in the 
artificial data has about ^^^^^^ unknowns , which is quadratic with respect to p. By contrast, the number of 
samples only appears in S, and does not directly affect the speed of SBGLasso. 



Table 2 

Comparing the performance of SBGLasso and the graphical lasso on the artificial data 



n, p 


SBFGLasso 


Graphical Lasso 


total time (s) 


relative error 


total time(s) 


relative error 


n = 1000, p = 500 


2.61 


0.1136 


1.94 


0.1136 


n = 1000, p = 1000 


14.09 


0.1192 


27.08 


0.1193 


n = 1000, p = 2000 


88.71 


0.1170 


217.88 


0.1170 


n = 2000, p = 3000 


269.53 


0.0980 


729.63 


0.0979 
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Fig 1: CPU times for SBGLasso for the same problem as in Table 2, for different values of n and p. (a) n is fixed 
and equals to 1000; (b) p is fixed and equals to 2000. 

4- 2. 2. Gene expression data 

We next apply SBGLasso to learn gene regulatory networks from microarray gene expression data. We 
used the Rosetta Inpharmatics compendium of gene expression profiles described by Hughes et al. [44]. The 
compendium contains microarray measurements of the expressions of 6316 genes in response to 300 diverse 
mutations and chemical treatments in S. cerevisiae. We preprocessed this data by removing genes that either 
contain missing values in some experiments, or have low variance across different experimental conditions. 
This led us to a final dataset of p = 1000 genes with measurements across n — 300 experiments. 

We used SBGLasso to solve the sparse graphical model (1) on this data, and compared its performance to 
the graphical lasso [9]. The results are summarized in Table 3, which shows the computational times spent by 
different solvers for different choices of the regularization parameter A. SBGLasso consistently outperforms 
the graphical lasso, saving more than 40% time in all cases. The concentration matrices derived by SBGLasso 
and the graphical lasso are very similar. Table 3 also lists the values of the objective function achieved when 
the algorithms converge, showing that the results from the two solvers are almost identical. 

Table 3 

Comparing the performance of SBGLasso and the graphical lasso on the gene expression data 



A 


SBFGLasso 


Graphical Lasso 


total time (s) 


energy value 4>(0) 


total time(s) 


energy value $(0) 


A = 0.01 


89.10 


-2.96 X 10'' 


167.23 


-2.96 X 10'' 


A = 0.015 


59.95 


-2.78 X 10-' 


147.92 


-2.79 X 10'' 


A = 0.02 


54.41 


-2.72 X 10'' 


116.18 


-2.72 X 10'' 


A = 0.025 


52.94 


-2.69 X 10'' 


87.74 


-2.69 X 10'' 



5. Discussion 

The problem of inverse covariance matrix estimation arises in many applications. Because the number of 
available samples is usually much smaller than the number of free parameters associated with the inverse 
covariance matrix, a parsimony approach is to select the simplest matrix that adequately explains the 
data. This leads to the idea of formulating the problem as a regularized maximum likelihood estimation 
problem with a penalty added to encourage the sparsity of the resulting matrix. Solving the regularized 
maximum likelihood problem is however nontrivial for large-scale problems, because of the complexity of the 
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log-likelihood term and the nondifferentiability of the regularization term. In this work, we propose a new 
approach based on the split Bregman method to solve the regularized inverse covariance matrix estimation 
problem. We show that the approach is very fast; it solves a 3000 x 3000 matrix estimation problem in less 
than 5 minutes. 

We compared the split Bregman method to the graphical lasso method, the state-of-the-art for solving 
£i penalized inverse covariance matrix estimation problems [9]. We show that our method is about twice 
faster than the graphical lasso for large-scale problems on both the artificial data and the gene expression 
data, while being able to achieve the same levels of accuracy. More importantly, our method is much more 
general and can handle a variety of sparsity regularization terms other than the £i norm, such as those used 
in elatistic net, fused lasso and grouped lasso. By contrast, the graphical lasso can only be applied to the £i 
norm penalty. 

The contribution of the paper lies in two aspects. First, we reformulated the regularized maximum like- 
lihood estimation problem by introducing an auxiliary variable, which leads to the decoupling of the log- 
likelihood term and the regularization term, making the problem amenable to the split Bregman method. 
Second, we proposed to use Newton's method to calculate the square root of a positive definite matrix, 
the major time consuming step of our algorithm. Because the matrix quadratic equation is separable in its 
eigenvalues, the conventional solver is first to compute the eigen decomposition and then solve univariate 
quadratic equations about the eigenvalues. This approach is generally slow, as the eigen decomposition is 
time-consuming. To accelerate it, we propose to use a matrix iterative algorithm [39, 38] to directly solve the 
matrix quadratic equation based on Newton's method, without using the eigen decomposition. The contri- 
bution is crucial and makes the derived algorithm from the split Bregman method very easy to implement 
and highly efficient. 
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